So far, we have focused on regression involving linear coefficients for numeric independent variables. Today, we turn our attention to new classes of regression terms:
- "dummy" variables
- categorical variables
- interactions
2015-07-27
So far, we have focused on regression involving linear coefficients for numeric independent variables. Today, we turn our attention to new classes of regression terms:
After class today you will be able to
We are going to be looking at the relationship between years of education and vocabulary. This data is originally from the GSS, and is obtained from John Fox's website.
data <- read.table("data/Vocabulary.txt", header=TRUE)
data <- tbl_df(data)
p <- ggplot(data, aes(x=education, y=vocabulary))
p + geom_jitter() + stat_smooth(method = "lm") + theme_bw()
p <- ggplot(data, aes(x=education)) p + geom_histogram(binwidth=1, colour="black", fill="dark grey", origin = -0.5) + scale_x_continuous(breaks=0:20) + theme_bw() + facet_grid(sex ~ .)
p <- ggplot(data, aes(x=vocabulary)) p + geom_histogram(binwidth=1, colour="black", fill="white", origin = -0.5) + scale_x_continuous(breaks=0:10) + facet_grid(sex ~ .) + theme_economist_white()
p <- ggplot(data, aes(x=education, y=vocabulary, colour=sex)) p + geom_jitter() + scale_colour_brewer(palette=2,type='qual') + theme_bw() + theme(legend.position = c(.9,.1))
fit <- lm(vocabulary ~ education + sex, data = data) tidy(fit)
## term estimate std.error statistic p.value ## 1 (Intercept) 1.5080568 0.055641335 27.103174 4.086269e-159 ## 2 education 0.3578652 0.004193481 85.338458 0.000000e+00 ## 3 sexMale -0.2106902 0.025748042 -8.182766 2.925683e-16
What does it look like to draw this regression line?
(let's draw…)
p + geom_jitter(alpha = .1) + scale_colour_brewer(palette=2,type='qual') + theme_bw() + geom_smooth(aes(group=sex),method='lm',lty=2,size=2) + theme(legend.position = c(.9,.1))
ALWAYS RECODE BINARY VARIABLES SO THAT THE VALUES ARE 0/1
ALWAYS MAKE SURE THE CODING SCHEME CARRIES MNEMONIC SIGNIFICANCE SO THAT 1=YES AND 0=NO
YES THIS WARRANTS ALL-CAPS!
What is going on in this code?
data <- mutate(data, female = as.numeric(sex=="Female")) head(data)
## Source: local data frame [6 x 6] ## ## id year sex education vocabulary female ## (int) (int) (fctr) (int) (int) (dbl) ## 1 20040001 2004 Female 9 3 1 ## 2 20040002 2004 Female 14 6 1 ## 3 20040003 2004 Male 14 9 0 ## 4 20040005 2004 Female 17 8 1 ## 5 20040008 2004 Male 14 1 0 ## 6 20040010 2004 Male 14 7 0
sex=='Female' produces TRUE/FALSE vectordplyr::mutate, we make the variable female equal to 1 for females and 0 for malesDoes recoding a dummy variable change a regression?
How or why?
How does our new model:
(fit.df <- tidy(lm(vocabulary ~ education + female, data = data)))
## term estimate std.error statistic p.value ## 1 (Intercept) 1.2973666 0.057841978 22.429499 3.656442e-110 ## 2 education 0.3578652 0.004193481 85.338458 0.000000e+00 ## 3 female 0.2106902 0.025748042 8.182766 2.925683e-16
compare to old model:
tidy(lm(vocabulary ~ education + sex, data = data))
## term estimate std.error statistic p.value ## 1 (Intercept) 1.5080568 0.055641335 27.103174 4.086269e-159 ## 2 education 0.3578652 0.004193481 85.338458 0.000000e+00 ## 3 sexMale -0.2106902 0.025748042 -8.182766 2.925683e-16
The concept to keep in mind is that fitting a dichotomous variable has the effect of fitting two different intercepts, one for each group
What we are testing is whether fitting a separate intercept adds sufficient explanatory power to warrent the additional complication
fit.df
## term estimate std.error statistic p.value ## 1 (Intercept) 1.2973666 0.057841978 22.429499 3.656442e-110 ## 2 education 0.3578652 0.004193481 85.338458 0.000000e+00 ## 3 female 0.2106902 0.025748042 8.182766 2.925683e-16
Is the separate intercept for females warranted here? Why or why not?
We are going to be looking at the relationship average years of education, average income, occupation category, and occupational prestige. This data is from the 1971 Canadian Census obtained from John Fox's website.
data <- read.table("data/Prestige.txt", header=TRUE)
head(data)
## education income women prestige census type ## GOV.ADMINISTRATORS 13.11 12351 11.16 68.8 1113 prof ## GENERAL.MANAGERS 12.26 25879 4.02 69.1 1130 prof ## ACCOUNTANTS 12.77 9271 15.70 63.4 1171 prof ## PURCHASING.OFFICERS 11.42 8865 9.11 56.8 1175 prof ## CHEMISTS 14.62 8403 11.68 73.5 2111 prof ## PHYSICISTS 15.64 11030 5.13 77.6 2113 prof
data <- tbl_df(data) head(data)
## Source: local data frame [6 x 6] ## ## education income women prestige census type ## (dbl) (int) (dbl) (dbl) (int) (fctr) ## 1 13.11 12351 11.16 68.8 1113 prof ## 2 12.26 25879 4.02 69.1 1130 prof ## 3 12.77 9271 15.70 63.4 1171 prof ## 4 11.42 8865 9.11 56.8 1175 prof ## 5 14.62 8403 11.68 73.5 2111 prof ## 6 15.64 11030 5.13 77.6 2113 prof
data <- read.table("data/Prestige.txt", header=TRUE)
data$occupation <- rownames(data)
rownames(data) <- NULL
data <- tbl_df(data)
data <- filter(data, !is.na(type))
head(data)
## Source: local data frame [6 x 7] ## ## education income women prestige census type occupation ## (dbl) (int) (dbl) (dbl) (int) (fctr) (chr) ## 1 13.11 12351 11.16 68.8 1113 prof GOV.ADMINISTRATORS ## 2 12.26 25879 4.02 69.1 1130 prof GENERAL.MANAGERS ## 3 12.77 9271 15.70 63.4 1171 prof ACCOUNTANTS ## 4 11.42 8865 9.11 56.8 1175 prof PURCHASING.OFFICERS ## 5 14.62 8403 11.68 73.5 2111 prof CHEMISTS ## 6 15.64 11030 5.13 77.6 2113 prof PHYSICISTS
p <- ggplot(data, aes(x=education, y=prestige)) p + geom_jitter() + stat_smooth(method = "lm") + theme_bw()
p <- ggplot(data, aes(x=education, y=prestige)) p + geom_jitter() + theme_bw() + facet_grid(. ~ type) + stat_smooth(method = "lm")
fit <- lm(prestige ~ education + type, data = data) fit.df <- tidy(fit) print(fit.df)
## term estimate std.error statistic p.value ## 1 (Intercept) -2.698159 5.736093 -0.4703828 6.391712e-01 ## 2 education 4.572793 0.671564 6.8091697 9.159877e-10 ## 3 typeprof 6.142444 4.258961 1.4422401 1.525583e-01 ## 4 typewc -5.458495 2.690667 -2.0286769 4.532001e-02
Let's the graph of the relationship occupational presitge, education, and job type implied by these results…
Any categorical variable can be represented by a series of dummy variables
Each group gets its own intercept
What are potential pitfalls or downsides to this (e.g., if \(k\) gets big)?
How often do we care about intercepts? What if we also want to different slopes for different groups?
Back to education and vocabulary
data <- mutate(data, female = as.numeric(sex=="Female")) fit <- lm(vocabulary ~ education + female + education:female, data = data) fit.df <- tidy(fit) print(fit.df)
## term estimate std.error statistic p.value ## 1 (Intercept) 1.348928559 0.080013149 16.8588360 2.294345e-63 ## 2 education 0.353897438 0.005973606 59.2435195 0.000000e+00 ## 3 female 0.110384052 0.110587272 0.9981624 3.182118e-01 ## 4 education:female 0.007823049 0.008387855 0.9326638 3.510040e-01
Sketch the graph of the relationship between vocabulary, education, and gender implied by this result.
Put simply, what did we do?
When you interact a dummy variable with a numeric variable, you are fitting a separate slope for each group represented by the dummy variable
We can look at how the terms cancel out (since female equals 0 or 1) to accomplish this:
\[ \hat{y_i} = \beta_0 + \beta_1X_{1[i]} + \beta_2X_{2[i]} + \beta_3X_{1[i]}X_{2[i]} \]
## (Intercept) education female education:female ## 1.348928559 0.353897438 0.110384052 0.007823049
Modeling question becomes, "does explanatory power of having separate slopes outweigh computational costs?""
line.size <- 2
p <- ggplot(data, aes(x=education, y=vocabulary, colour=sex))
p + geom_jitter(alpha=0.7) +
scale_color_manual(values = c("dark green", "dark orange")) +
geom_abline(intercept = inter.male, slope = slope.male, colour="dark orange", size=line.size) +
geom_abline(intercept = inter.female, slope = slope.female, colour="dark green", size=line.size) +
expand_limits(x = 0, y = 0)
Back to occupational prestige…
data <- read.table("data/Prestige.txt", header=TRUE)
data$occupation <- rownames(data)
rownames(data) <- NULL
data <- tbl_df(data)
data <- filter(data, !is.na(type))
head(data)
## Source: local data frame [6 x 7] ## ## education income women prestige census type occupation ## (dbl) (int) (dbl) (dbl) (int) (fctr) (chr) ## 1 13.11 12351 11.16 68.8 1113 prof GOV.ADMINISTRATORS ## 2 12.26 25879 4.02 69.1 1130 prof GENERAL.MANAGERS ## 3 12.77 9271 15.70 63.4 1171 prof ACCOUNTANTS ## 4 11.42 8865 9.11 56.8 1175 prof PURCHASING.OFFICERS ## 5 14.62 8403 11.68 73.5 2111 prof CHEMISTS ## 6 15.64 11030 5.13 77.6 2113 prof PHYSICISTS
data <- mutate(data, prof = as.numeric(type=="prof")) data <- mutate(data, white.collar = as.numeric(type=="wc")) fit <- lm(prestige ~ education + prof + white.collar + education:type, data = data) fit.df <- tidy(fit) print(fit.df)
## term estimate std.error statistic p.value ## 1 (Intercept) -4.293600 8.647020 -0.4965409 6.206972e-01 ## 2 education 4.763651 1.024740 4.6486436 1.112981e-05 ## 3 prof 18.863655 16.888126 1.1169774 2.669126e-01 ## 4 white.collar -24.383279 21.777713 -1.1196437 2.657802e-01 ## 5 education:typeprof -0.980805 1.449480 -0.6766598 5.003196e-01 ## 6 education:typewc 1.670938 2.077688 0.8042294 4.233378e-01
Let's sketch the graph showing the relationship between eduation and occupational prestige implied by this result…
suppressPackageStartupMessages(library(dplyr)) getwd()
## [1] "/Users/TScott/Google Drive/Webpage/teaching/PADP8120_Fall2015/Slides"
load("data/gss_2010_training.RData")
gss.training <- tbl_df(gss.training)
gss <- select(gss.training, income06_n, educ, maeduc, paeduc, sex) %>%
filter(!is.na(income06_n), !is.na(educ), !is.na(maeduc), !is.na(paeduc), !is.na(sex))
# NOTE: DROPPING MISSING DATA LIKE THIS CAN BE DANGEROUS
gss <- rename(gss, income = income06_n)
gss <- mutate(gss, female = as.numeric(sex=="female"))
Brambor et al. (2006) advise: include interaction terms whenever there is a conditional hypotheses
Hypothesis: The relationship between education and income is different for women and men.
or
Hypothesis: There is a relationship between education and income.
\[\widehat{income}_i = \beta_0 + \beta_1 educ_i + \beta_2 sex_i + \beta_3 sex_i * educ_i\]
What is "conditional"? The relationship between education and income is conditional on sex.
\[\widehat{income}_i = \beta_0 + \beta_1 educ_i + \beta_2 sex_i + \beta_3 sex_i * educ_i\]
fit <- lm(income ~ educ + female + educ:female, data = gss) tidy(fit)
## term estimate std.error statistic p.value ## 1 (Intercept) 11.1928734 0.72610155 15.415025 6.483265e-51 ## 2 educ 0.5761664 0.05043841 11.423167 2.243893e-29 ## 3 female -3.8500659 1.01361238 -3.798361 1.497277e-04 ## 4 educ:female 0.1961346 0.07047251 2.783135 5.431051e-03
Constitutive terms are the base terms that make up an interaction
If you are modeling \(y \sim x_1 + x_2 + x_1\times x_2\), \(x_1\) and \(x_2\) are constitutive terms
You might be tempted to just model \(y = x_1 \times x_2\) instead…
Let's say this is the model we specify…
fit <- lm(income ~ educ + educ:female, data = gss) fit.df <- tidy(fit) fit.df
## term estimate std.error statistic p.value ## 1 (Intercept) 9.21718217 0.50822134 18.136157 1.753675e-68 ## 2 educ 0.71015251 0.03616492 19.636503 4.698687e-79 ## 3 educ:female -0.06556639 0.01485827 -4.412788 1.071287e-05
How do we interpret this? (perhaps draw a picture…)
p1 <- ggplot(gss, aes(x=educ, y=income, col=sex))
(p1 <- p1 + geom_jitter() + theme(legend.position=c(.8,.2))+
scale_color_manual(values = c("blue", "dark green")) +
geom_abline(intercept = inter, slope = slope.male, col="blue") +
geom_abline(intercept = inter, slope = slope.female, col="dark green"))
We don't want to require that men and women have the same predicted level of income at 0 years of schooling!
When you specify a model, think carefully about what you are saying!
p2 <- ggplot(gss, aes(x=educ, y=income, col=sex))
(p2 <- p2 + geom_jitter() + theme(legend.position=c(.8,.2))+
scale_color_manual(values = c("blue", "dark green")) +
geom_abline(intercept = inter.male, slope = slope.male, col="blue") +
geom_abline(intercept = inter.female, slope = slope.female, col="dark green"))
fit <- lm(income ~ educ + female + educ:female, data = gss) tidy(fit)
## term estimate std.error statistic p.value ## 1 (Intercept) 11.1928734 0.72610155 15.415025 6.483265e-51 ## 2 educ 0.5761664 0.05043841 11.423167 2.243893e-29 ## 3 female -3.8500659 1.01361238 -3.798361 1.497277e-04 ## 4 educ:female 0.1961346 0.07047251 2.783135 5.431051e-03
If sex matters, then it would be strange to say people with one year of additional education are predicted to have 0.6 higher income (sorry that income is not measured in dollars).
The point of the model is that the difference in predicted value depends on whether the respondent is a man or a woman.
## term estimate std.error statistic p.value ## 1 (Intercept) 11.1928734 0.72610155 15.415025 6.483265e-51 ## 2 educ 0.5761664 0.05043841 11.423167 2.243893e-29 ## 3 female -3.8500659 1.01361238 -3.798361 1.497277e-04 ## 4 educ:female 0.1961346 0.07047251 2.783135 5.431051e-03
Effect of education should be intepreted as:
0.576 + 0.196 * Female
It can be helpful to think about it like taking a partial derivative:
\[Income = \beta_0 + \beta_1 Educ + \beta_2 Female + \beta_3 Educ Female\]
\[\frac{dIncome}{dEduc} = \beta_1 + \beta_3 * Female\]
fit <- lm(income ~ educ + maeduc + educ:maeduc, data = gss) tidy(fit)
## term estimate std.error statistic p.value ## 1 (Intercept) 8.44645152 1.092420927 7.7318654 1.620658e-14 ## 2 educ 0.65796442 0.084565682 7.7805134 1.116219e-14 ## 3 maeduc 0.12757456 0.102105208 1.2494423 2.116404e-01 ## 4 educ:maeduc -0.00288615 0.007258511 -0.3976228 6.909480e-01
More importantly, what if interaction is with numeric variable?
For this model educ coefficient is predicted effect of 1 year of education when mother's education (maeduc1) is 0!
"One cannot determine whether a model should include an interaction term simply by looking at the significance of the coefficient on the interaction term." - Brambor et al. 2006, p. 74
In most tables, we only see effects when variables = 0
We will return to this in our discussion section…
\[\begin{eqnarray} log(inc_{i}) &=& \beta_0 + \nonumber \\ & & \beta_1 Chinese_i + \beta_2 Asian Indian + . . . + \beta_{18} Native American + \nonumber \\ & & \beta_{19} female_i + \nonumber \\ & & \beta_{20} Chinese_i * female_i + \beta_{21} Asian Indian * female_i + . . . + \nonumber \\ & & \epsilon_i \end{eqnarray}\]
Why are there 18 race/ethnicity dummy variables and 19 race/ethnicity groups studied?
Model for white man:
\[\begin{eqnarray} log(inc_{i}) &=& \beta_0 + \nonumber \\ & & \epsilon_i \end{eqnarray}\]
Model for white female:
\[\begin{eqnarray} log(inc_{i}) &=& \beta_0 + \nonumber \\ & & \beta_{19} female_i + \nonumber \\ & & \epsilon_i \end{eqnarray}\]
Model for Chinese-American man:
\[\begin{eqnarray} log(inc_{i}) &=& \beta_0 + \nonumber \\ & & \beta_1 Chinese_i + \nonumber \\ & & \epsilon_i \end{eqnarray}\]
Model for Chinese-American female:
\[\begin{eqnarray} log(inc_{i}) &=& \beta_0 + \nonumber \\ & & \beta_1 Chinese_i + \nonumber \\ & & \beta_{19} female_i + \nonumber \\ & & \beta_{20} Chinese_i * female_i + \nonumber \\ & & \epsilon_i \end{eqnarray}\]
(a little difficult to makes sense of…) ##
How would this be visible in residuals of model with no interaction?
\[\begin{eqnarray} log(inc_{i}) &=& \beta_0 + \nonumber \\ & & \beta_1 Chinese_i + \beta_2 Asian Indian + . . . + \beta_{18} Native American + \nonumber \\ & & \beta_{19} female_i + \nonumber \\ & & \epsilon_i \end{eqnarray}\]
Hint: Here's the full model \[\begin{eqnarray} log(inc_{i}) &=& \beta_0 + \nonumber \\ & & \beta_1 Chinese_i + \beta_2 Asian Indian + . . . + \beta_{18} Native American + \nonumber \\ & & \beta_{19} female_i + \nonumber \\ & & \beta_{20} Chinese_i * female_i + \beta_{21} Asian Indian * female_i + . . . + \nonumber \\ & & \epsilon_i \end{eqnarray}\]
How much does this increase how much you believe this result?
How must their code be organized in order to be able to run these checks reliably?
Image by Roger Peng
Questions?
Goal check
Motivation for next class
sessionInfo()
## R version 3.2.2 (2015-08-14) ## Platform: x86_64-apple-darwin13.4.0 (64-bit) ## Running under: OS X 10.11.1 (El Capitan) ## ## locale: ## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 ## ## attached base packages: ## [1] stats graphics grDevices utils datasets methods base ## ## other attached packages: ## [1] broom_0.3.7 ggthemes_2.2.1 ggplot2_1.0.1 dplyr_0.4.3 ## ## loaded via a namespace (and not attached): ## [1] Rcpp_0.12.1 knitr_1.11 magrittr_1.5 ## [4] MASS_7.3-44 mnormt_1.5-3 munsell_0.4.2 ## [7] colorspace_1.2-6 R6_2.1.1 stringr_1.0.0 ## [10] plyr_1.8.3 tools_3.2.2 parallel_3.2.2 ## [13] grid_3.2.2 gtable_0.1.2 psych_1.5.8 ## [16] DBI_0.3.1 htmltools_0.2.6 lazyeval_0.1.10 ## [19] yaml_2.1.13 assertthat_0.1 digest_0.6.8 ## [22] RColorBrewer_1.1-2 tidyr_0.3.1 reshape2_1.4.1 ## [25] formatR_1.2.1 evaluate_0.8 rmarkdown_0.8 ## [28] labeling_0.3 stringi_0.5-5 scales_0.3.0 ## [31] proto_0.3-10